Skip to content

Convert Perl utf16 to utf8 functions to macros #23554

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: blead
Choose a base branch
from

Conversation

khwilliamson
Copy link
Contributor

These functions are hereby removed in favor of calling the plain macros that already exist

  • This set of changes does not require a perldelta entry.

These functions are hereby removed in favor of calling the plain macros
that already exist
Copy link

@bram-perl bram-perl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly wondering about the "why";
I guess what I'm really wondering about is why where they added as functions in the first place?

@khwilliamson
Copy link
Contributor Author

This wasn't feasible until 93f23f0. The commit message there is poorly worded. If the function did not have a thread context, it could be a macro. That commit allowed functions with a thread context to also be.

Basically no one had bothered (or been able to figure out how) to deal with macros whose first parameter, the thread context, isn't always there. The problem is that you need a comma separating the formal parameters in a macro definition. In a function definition the parameter can be pTHX_ expanding to nothing, but you can't say

#define foo(a_ b, c)

I've thought about this for a long time. There might be a way to do it if the header files are called recursively. Most header files have an #ifndef GUARD at the beginning to prevent recursive calling. I actually filed a ticket against K&R C to make the guard the default, but got no answer. Others told me it was probably because RItchie used it for some trickery, maybe some trojan horse like he put in the compiler. I've never sat down and tried to figure out how this might work to some advantage. And I've never seen an example of it being used.

The above commit enables this behavior for things listed in embed.fnc. Future commits I am envisioning will expand on this. It works by defining two cases, one if there are threads; and one without

@bulk88
Copy link
Contributor

bulk88 commented Aug 10, 2025

This commit would break CPAN unmodified source code compatibility, if the CPAN module authors decided to be risk takers and use unauthorized by P5P libperl features.

EMXp -> Emp conversion will remove these symbols from perl543.dll's sym table.

Name	Address	Ordinal
Perl_utf16_to_utf8	00000001800E79A4	1054
Perl_utf16_to_utf8_base	00000001800E7880	1055
Perl_utf16_to_utf8_reversed	00000001800E79C8	1056
Perl_utf8_to_utf16_base	00000001800E79EC	1060
U8*
Perl_utf16_to_utf8_reversed(pTHX_ U8* p, U8* d, Size_t bytelen, Size_t *newlen)
{
    PERL_ARGS_ASSERT_UTF16_TO_UTF8_REVERSED;

    return utf16_to_utf8_reversed(p, d, bytelen, newlen);
}

These APIs are very poorly designed and outright obfuscated.
I dont see a spelling difference or am I getting old?

Perl_utf16_to_utf8_reversed(pTHX_ U8* p, U8* d, Size_t bytelen, Size_t *newlen)
       utf16_to_utf8_reversed(                   p,        d,           bytelen,            newlen);

Time to open Ghidera to understand this JAPH GOLF contest.

char * Perl_utf16_to_utf8_reversed(interpreter *my_perl, char *p, char *d, unsigned __int64 bytelen, unsigned __int64 *newlen) {
  return Perl_utf16_to_utf8_base(my_perl, p, d, bytelen, newlen, 1, 0);
}

48 83 EC 48                sub     rsp, 48h
48 8B 44 24 70             mov     rax, [rsp+48h+newlen]
C6 44 24 30 00             mov     [rsp+48h+low_byte], 0 ; low_byte
C6 44 24 28 01             mov     [rsp+48h+high_byte], 1 ; high_byte
48 89 44 24 20             mov     [rsp+48h+var_28], rax ; newlen
E8 9B FE FF FF             call    Perl_utf16_to_utf8_base
48 83 C4 48                add     rsp, 48h
C3                         retn
                           Perl_utf16_to_utf8_reversed endp

Perl_utf16_to_utf8_reversed() broke Win64's redzone/regcall rules by having 5 arguments, the limit is 4 registers on Win64 x64, 6 on Linux x64. i386 Linux has zero register, ask SysV/ELF authors, i386 Win hypothetically has 2 using bizzare compiler options that only I know about, and I didn't like what I saw codegen wise in i386 WinPerl, so I never put that patch on RT. libperl.dll grew 12 kilobytes by flipping the CC's regcall switch.

Symbol's Perl_utf16_to_utf8_reversed prototype is very poor should have never landed in the repo in the first place.

The 5th arg is spaghetti/obfuscation.

perl5/utf8.c

Line 3372 in 07457a3

* 'p' is the UTF-16 input string, passed as a pointer to U8.

S_utf16_textfilter
Perl_utf16_to_utf8
Perl_utf16_to_utf8_reversed
.rdata:off_18030EC28	dd rva PL_EXACTFish_bitmask, rva PL_EXACT_REQ8_bitmask; PerlProcGetgid(IPerlProc const * *) ...

3 callers+1 exp sym tab entry at -O1 with LTO. Since its an extern sym, LTO is illegal to do on it. C is C, ABI is ABI.

char * Perl_utf16_to_utf8_base(interpreter *my_perl, char *p, char *d, unsigned __int64 bytelen, unsigned __int64 *newlen, const bool high_byte, const bool low_byte)

7 arguments!!!!

 1, 0)

128 bits of storage were used to encode 2 bits of information. Or in metric, 16 bytes of storage, were used to encode 1 byte of information.

U8, U16, U32 are very nice Perl C API types and all 3 are a better idea than type U128. Perl is not written in Rust, Go, or Javascript. It is written in C.

ISO C looks at these 3 C features, unsigned int u = 'strn'; and int bitfields :1; and enum as if they are 4 star Germany Army Generals in the year 1946. Very very sad in my SW engineering opinion. Those 3 C grammar features, on the assembly code level, have been fake-ed by the Perl C API from 1.0-5.43.0-5.Inf.0. if (SvFLAGS(sv) & SVf_BLAH) {.

Now lets keep deciphering this unreadable character conversion algorithm.

/*
 * Convert UTF-8 to native UTF-16. Called via the macros utf8_to_utf16() for
 * big-endian and utf8_to_utf16_reversed() for little-endian,
 *
 * 's' is the UTF-8 input string, passed as a pointer to U8.
 * 'bytelen' is its length
 * 'd' is the pointer to the destination buffer, currently passed as U8 *.  The
 *     caller must ensure that the space is large enough.  The maximum
 *     expansion factor is 2 times 'bytelen'.  This happens when the input is
 *     entirely single-byte ASCII, expanding to two-byte UTF-16.
 * '*newlen' will contain the number of bytes this function filled of 'd'.
 * 'high_byte' is 0 if UTF-16BE; 1 if UTF-16LE
 * 'low_byte'  is 1 if UTF-16BE; 0 if UTF-16LE

Why isn't the retval documented? I see a U8* not void.

 * Do not use in-place. */
U8*
Perl_utf8_to_utf16_base(pTHX_ U8* s, U8* d, Size_t bytelen, Size_t *newlen,
                              const bool high_byte, /* Which of next two bytes
                                                       is high order */
                              const bool low_byte)
{
    U8* send;
    U8* dstart = d;

    PERL_ARGS_ASSERT_UTF8_TO_UTF16_BASE;

    send = s + bytelen;

Actually a good practice. Don't use int i, advance your pointer in units of 2/4/8, using a simple +0x0LITERAL in asm/Asm-in-C, not ptr = ptr +( i << 3); I see 2 math operators in ptr = ptr +( i << 4);, I see 1 math operator in ptr = ptr + 4;.


    while (s < send) {
        STRLEN retlen;
        UV uv;
        (void) c9strict_utf8_to_uv(s, send, &uv, &retlen);

Why do we have a free retval cpu register we aren't using?

Again, poor prototype layout. &uv and &retlen are now declared with ISO C's volatile token, since they escaped this function into another function, and can never live again inside a CPU register. A C compiler can never determine if linker symbol c9strict_utf8_to_uv saved a copy of mem addr &uv, and &retlen for long term storage, possible being read/written to continously on another thread on another CPU core. Perl's SvPV() macro also has this problem, I actually have a branch to fix once and forever somewhere in a RT ticket.

SysV ABI .pdf guarentees linux's C ABI has a U128 retval register on AMD64 CPUs.

MS's i386 ABI has a U64 register retval guaranteed per ABI .pdf.

That will hold and transport Perl's char* and STRLEN vars from Perl_sv_2pvn_flags() symbol back to the caller side by side very very efficiently. Perl C API doesn't know what a pass by copy struct return type is, that is very sad.

MS's Win64 ABI .pdf is brain damaged, and retval is a U64 register on AMD 64 arch. @bulk88 has convinced MSVC Clang and GCC to give Perl C linker fn symbols compiled Win64 on AMD64. There is a secret loop hole in the Win64 for AMD64 ABI document that will give me a U128 integer retval type in MS-flavored ISC C land on MSVC/GCC/Clang ;-)

And if there is a (void) on the left side in most place in Perl VM C code, GCC or Clang is probably saying something not smart is going on.


        if (uv >= FIRST_IN_PLANE1) {    /* Requires a surrogate pair */

            /* From https://unicode.org/faq/utf_bom.html#utf16-4 */
            U32 high_surrogate = (uv >> 10) - (FIRST_IN_PLANE1 >> 10)
                               + FIRST_HIGH_SURROGATE;

            /* (The bool is cast to U8 because on platforms where a bool is
             * implemented as a signed char, a compiler warning may be
             * generated) */
            d[(U8) high_byte] = high_surrogate >> 8;
            d[(U8) low_byte]  = high_surrogate & nBIT_MASK(8);

looks like a branchless algo, good. I'm not happy about the unaligned U16* paranoia here, but Perl 5's UA memory read/write macros (macro infrastructure) in 5.43's git repo, its there, nobody but demerphic ever cared about P5's UA mem macros working, and the macros are and easter egg in the P5P .git repo right now. I should really donate my day job's headers to the P5 repo for UA OS safe mem R/Ws.

            d += 2;

            /* The low surrogate is the lower 10 bits plus the offset */
            uv &= nBIT_MASK(10);
            uv += FIRST_LOW_SURROGATE;

            /* Drop down to output the low surrogate like it were a
             * non-surrogate */
        }

        d[(U8) high_byte] = uv >> 8;
        d[(U8) low_byte] = uv & nBIT_MASK(8);
        d += 2;

        s += retlen;
    }

    *newlen = d - dstart;
    return d;
}

the last 2 lines make no sense, Either return the buffer length/array length, as the retval, or return ptr to last byte or ptr to last byte +1, Why is this function returning imperial units in a outgoing data size_t * arg, and returning metric units in the fn's retval. Same exact piece of info.

Next stage of decrypting this code, ^^^^ isn't C code. |/|/|/|/|/|/ below is C code.

U8 *Perl_utf8_to_utf16_base(PerlInterpreter *my_perl, U8 *s, U8 *d, size_t bytelen, size_t *newlen,
                            const _Bool high_byte, const _Bool low_byte)
{
    U8 *send;
    U8 *dstart = d;
    ((void)0);
    ((void)0);
    ((void)0);
    send = s + bytelen;
    while (s < send)
    {
        STRLEN retlen;
        UV uv;
        (void)Perl_utf8_to_uv_msgs(s, send, &uv, &retlen, (0x0200 | 0x0080 | 0x0020), 0, 0);
        if (uv >= 0x10000)
        {
            U32 high_surrogate = (uv >> 10) - (0x10000 >> 10) + 0xD800;
            d[(U8)high_byte] = high_surrogate >> 8;
            d[(U8)low_byte] = high_surrogate & (((1ULL) << (8)) - 1);

I have a trick using U32*s and vectoring this to do 2 at a time UTF16 chrs over 4 bytes, down to 2 bytes/2 code points in a U16* L1/UTF low 7bit trick. MS Win7-Win11 uses it. Glibc probably does to if I check. 98% of UTF16 byte streams that exist right now on planet earth as sin waves on copper are 7 bit clean ASCII ya know ;-) Just lets not talk about Japan. Their Ministry of Commerce/Academy of Sciences has never used "lower 7 bit bit clean ASCII" in the history of that country. Its some to do with $$$ or the button u press on a QWERTY keyboard for $$$ jkjk.

            d += 2;
            uv &= (((1ULL) << (10)) - 1);
            uv += 0xDC00;
        }
        d[(U8)high_byte] = uv >> 8;
        d[(U8)low_byte] = uv & (((1ULL) << (8)) - 1);

I hope this is getting constant folded away by all 3 big C compilers. But we are in the function body of an extern "C" linker symbol remember.

        d += 2;
        s += retlen;
    }
    *newlen = d - dstart;
    return d;
}

Now lets send this C code through a -O1/-O2 C compiler, and figure out what it does in real life. This is MSVC 2022 -O1 LTO on.

char * Perl_utf16_to_utf8_base(interpreter *my_perl, char *p, char *d, unsigned __int64 bytelen, unsigned __int64 *newlen, const bool high_byte, const bool low_byte)
{
  char *pend;
  int low_byte2;
  int high_byte2;
  unsigned int uv;
  unsigned int low_surrogate;
  char *msg;

  if ( bytelen & 1 )
  {
    msg = "_reversed";
    if ( !high_byte )
      msg = empty_string;
    Perl_die(my_perl, "panic: utf16_to_utf8%s: odd bytelen %I64u", msg);
  }
  for ( pend = &p[bytelen]; p < pend; d = Perl_uvoffuni_to_utf8_flags_msgs(my_perl, d, uv, 0i64, 0i64) )

WHAT IS THAT 5 ARGUMENT function call doing in this hottest control flow path of this for() loop!!!!!!!

WHY ARE THERE 2 0x000000 const literals being written to the C stack? on each iteration of this loop?

This is a disaster.

  {
    low_byte2 = p[low_byte];
    high_byte2 = p[high_byte];
    p += 2;
    uv = low_byte2 + (high_byte2 << 8);
    if ( uv - 0xD800 <= 0x7FF )
    {
      if ( p >= pend
        || uv > 0xDBFF
        || (low_surrogate = p[low_byte] + (p[high_byte] << 8), low_surrogate - 0xDC00 > 0x3FF) )
      {
        Perl_die(my_perl, "Malformed UTF-16 surrogate");

!@#$%##$% March 2025 PR I didn't comment in time/didn't see in time, that degraded ithread perl on every platform
I haven't gotten around to a commit to reverse it, since I don't want to just revert 7 commits in a row, I'd rather use CPP macro tools to fix the problem invisibly especially with the new C99 VA_ARGS macro feature, but I got stuck on the branch last year with #ifdef MULTIPLICITY #else #endif logic to fix the last 20% invisibly of the useless push my_perl ptr onto C stack for the rarest executed function in all of Perl.

      }
      p += 2;
      uv = low_surrogate + (uv << 10) - 0x35FDC00;
    }
  }
  *newlen = d - d;
  return d;
}

Demo that I as a CPAN author, can [unauthorized] use the symbols that are proposed to be deleted by this commit.

static void S_u16t(pTHX_ CV* cv) {
    dXSARGS;

    if(items != 2)
	croak_xs_usage(cv, "u8buf, u2buf");
    SV* u2 = TOPs;
    SV* u1 = TOPs;
	STRLEN l;
	char * pv = SvPV(u1, l);
	char * out = sv_grow(u2, l*4);
	STRLEN o;
    utf16_to_utf8_reversed(pv, out, l, &o);
	SvCUR_set(u2, o);
	SvPOK_only(u2);
	PUTBACK;
    return;
}
"C:\pb64\bin\perl.exe" "C:\pb64\lib\ExtUtils/xsubpp"  -typemap "C:\pb64\lib\ExtU
tils\typemap"   khwutf16removal_pl_f1a6.xs > khwutf16removal_pl_f1a6.xsc
"C:\pb64\bin\perl.exe" -MExtUtils::Command -e mv -- khwutf16removal_pl_f1a6.xsc
khwutf16removal_pl_f1a6.c
cl -c   -nologo -GF -W3 -MD -TC -DWIN32 -D_CONSOLE -DNO_STRICT -DWIN64 -D_CRT_SE
CURE_NO_DEPRECATE -D_CRT_NONSTDC_NO_DEPRECATE -D_WINSOCK_DEPRECATED_NO_WARNINGS
-DPERL_TEXTMODE_SCRIPTS -DMULTIPLICITY -DPERL_IMPLICIT_SYS -DWIN32_NO_REGISTRY -
DUSE_PERLIO -O1 -Zi -GL -fp:precise   -DVERSION=\"0.00\" -DXS_VERSION=\"0.00\"
"-IC:\pb64\lib\CORE"   -Fdkhwutf16removal_pl_f1a6.pdb khwutf16removal_pl_f1a6.c
khwutf16removal_pl_f1a6.c
C:\pb64\lib\CORE\inline.h(1148): warning C4018: '<=': signed/unsigned mismatch
khwutf16removal_pl_f1a6.xs(123): warning C4133: 'function': incompatible types -
 from 'CV *' to 'SV *'
khwutf16removal_pl_f1a6.xs(124): warning C4133: ':': incompatible types - from '
SV *' to 'CV *'
khwutf16removal_pl_f1a6.xs(124): warning C4133: 'initializing': incompatible typ
es - from 'CV *' to 'SV *'
khwutf16removal_pl_f1a6.xs(148): warning C4133: 'function': incompatible types -
 from 'WCHAR *' to 'const char *const '
khwutf16removal_pl_f1a6.xs(170): warning C4267: '=': conversion from 'size_t' to
 'USHORT', possible loss of data
khwutf16removal_pl_f1a6.xs(172): warning C4090: '=': different 'const' qualifier
s
khwutf16removal_pl_f1a6.xs(225): warning C4013: 'utf16_to_utf8_base' undefined;
assuming extern returning int
"C:\pb64\bin\perl.exe" -MExtUtils::Mksymlists \
     -e "Mksymlists('NAME'=>\"khwutf16removal_pl_f1a6\", 'DLBASE' => 'khwutf16re
moval_pl_f1a6', 'DL_FUNCS' => {  }, 'FUNCLIST' => [], 'IMPORTS' => {  }, 'DL_VAR
S' => []);"
link -out:blib\arch\auto\khwutf16removal_pl_f1a6\khwutf16removal_pl_f1a6.dll -dl
l -nologo -nodefaultlib -debug -opt:ref,icf -ltcg -libpath:"c:\pb64\lib\CORE" -m
achine:AMD64 -subsystem:console,"5.02" khwutf16removal_pl_f1a6.obj   "C:\pb64\li
b\CORE\perl541.lib" oldnames.lib kernel32.lib user32.lib gdi32.lib winspool.lib
comdlg32.lib advapi32.lib shell32.lib ole32.lib oleaut32.lib netapi32.lib uuid.l
ib ws2_32.lib mpr.lib winmm.lib version.lib odbc32.lib odbccp32.lib comctl32.lib
 msvcrt.lib vcruntime.lib ucrt.lib -def:khwutf16removal_pl_f1a6.def
   Creating library blib\arch\auto\khwutf16removal_pl_f1a6\khwutf16removal_pl_f1
a6.lib and object blib\arch\auto\khwutf16removal_pl_f1a6\khwutf16removal_pl_f1a6
.exp
khwutf16removal_pl_f1a6.obj : error LNK2001: unresolved external symbol utf16_to
_utf8_base
blib\arch\auto\khwutf16removal_pl_f1a6\khwutf16removal_pl_f1a6.dll : fatal error
 LNK1120: 1 unresolved externals
gmake: *** [Makefile:471: blib\arch\auto\khwutf16removal_pl_f1a6\khwutf16removal
_pl_f1a6.dll] Error 1120

Defeated, gotta use the long name trick.

use Devel::Peek;
 #use Inline  ('force', 'noclean');

use Inline C => Config =>
 PRE_HEAD => '#define PERL_NO_GET_CONTEXT 1
 #define atowide() S_atowide(aTHX_ cv)
 
 ';
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => Config => PREFIX => 'MXS';

use Inline C => <<'END_OF_C_CODE';


static void S_u16t(pTHX_ CV* cv);

void BOOT1() {
    dTHX;
	newXS( "u16t", S_u16t, __FILE__);
	return;
}

static void S_u16t(pTHX_ CV* cv) {
    dXSARGS;

    if(items != 2)
	croak_xs_usage(cv, "u8buf, u2buf");
    SV* u2 = POPs;
    SV* u1 = POPs;
	STRLEN l;
	sv_dump(u1);
	sv_dump(u2);
	char * pv = SvPV(u1, l);
	char * out = sv_grow(u2, l*4);
	STRLEN o = l*4;
    Perl_utf16_to_utf8_reversed(aTHX_ pv, out, l, &o);
	SvCUR_set(u2, o);
	SvPOK_only(u2);
	sv_dump(u1);
	sv_dump(u2);
	PUTBACK;
    return;
}

END_OF_C_CODE


$, = "\n";

#use v5.30;
#use strict;
#use warnings;
#$DB::single = 1;
my $s4;
#$DB::single = 1;

use Encode;
use Benchmark qw(:all);
use Devel::Peek 'Dump';


BOOT1();

my $out;
my $in ="a\0b\0c\0d\0\0\0";
u16t($in, $out);
#Dump($out);
exit;

  Finished "make install" Stage

  Starting Cleaning Up Stage
  Finished Cleaning Up Stage

Finished Build Compile Stage

SV = PV(0x3add88) at 0x2732280
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x26861b8 "a\x00b\x00c\x00d\x00\x00\x00"\0
  CUR = 10
  LEN = 16
SV = NULL(0x0) at 0x3f1898
  REFCNT = 1
  FLAGS = ()
SV = PV(0x3add88) at 0x2732280
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x26861b8 "a\x00b\x00c\x00d\x00\x00\x00"\0
  CUR = 10
  LEN = 16
SV = PV(0x3addd8) at 0x3f1898
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0x3ddfd8 "abcd\x00"
  CUR = 5
  LEN = 41

C:\sources>

I win the game. There was no prize in this game. Its just a magic trick.

I can always copy paste whatever I want from the repo anyways as a CPAN author, a trick another CPAN dev more experienced than me told is his solution to all private internal APIs that P5P doesn't want to stick in perlapi.pod for legit technical reasons, or political reasons.

For research/archive purposes, here is a full asm dump of what this function does at -O1

  3417: U8*
  3418: Perl_utf16_to_utf8_base(pTHX_ U8* p, U8* d, Size_t bytelen, Size_t *newlen,
  3419:                               const bool high_byte, /* Which of next two bytes is
  3420:                                                   high order */
  3421:                               const bool low_byte)
  3422: {
000007FEA7087880 48 89 5C 24 08       mov         qword ptr [rsp+8],rbx  
000007FEA7087885 48 89 6C 24 10       mov         qword ptr [rsp+10h],rbp  
000007FEA708788A 48 89 74 24 18       mov         qword ptr [rsp+18h],rsi  
000007FEA708788F 57                   push        rdi  
000007FEA7087890 41 56                push        r14  
000007FEA7087892 41 57                push        r15  
000007FEA7087894 48 83 EC 30          sub         rsp,30h  
000007FEA7087898 4D 8B D0             mov         r10,r8  
000007FEA708789B 48 8B DA             mov         rbx,rdx  
000007FEA708789E 48 8B F1             mov         rsi,rcx  
  3423:     U8* pend;
  3424:     U8* dstart = d;
000007FEA70878A1 4D 8B F8             mov         r15,r8  
  3425: 
  3426:     PERL_ARGS_ASSERT_UTF16_TO_UTF8_BASE;
  3427: 
  3428:     if (bytelen & 1)
000007FEA70878A4 41 F6 C1 01          test        r9b,1  
000007FEA70878A8 0F 85 D0 00 00 00    jne         Perl_utf16_to_utf8_base+0FEh (07FEA708797Eh)  
  3430:                 ((high_byte == 0) ? "" : "_reversed"), (UV)bytelen);
  3431:     pend = p + bytelen;
000007FEA70878AE 4A 8D 3C 0A          lea         rdi,[rdx+r9]  
  3432: 
  3433:     while (p < pend) {
000007FEA70878B2 48 3B D7             cmp         rdx,rdi  
000007FEA70878B5 0F 83 89 00 00 00    jae         Perl_utf16_to_utf8_base+0C4h (07FEA7087944h)  
  3430:                 ((high_byte == 0) ? "" : "_reversed"), (UV)bytelen);
  3431:     pend = p + bytelen;
000007FEA70878BB 0F B6 6C 24 78       movzx       ebp,byte ptr [high_byte]  
000007FEA70878C0 44 0F B6 B4 24 80 00 00 00 movzx       r14d,byte ptr [low_byte]  
  3434: 
  3435:         /* Next 16 bits is what we want.  (The bool is cast to U8 because on
  3436:          * platforms where a bool is implemented as a signed char, a compiler
  3437:          * warning may be generated) */
  3438:         U32 uv = (p[(U8) high_byte] << 8) + p[(U8) low_byte];
000007FEA70878C9 41 0F B6 04 1E       movzx       eax,byte ptr [r14+rbx]  
000007FEA70878CE 0F B6 0C 2B          movzx       ecx,byte ptr [rbx+rbp]  
  3439:         p += 2;
000007FEA70878D2 48 83 C3 02          add         rbx,2  
000007FEA70878D6 C1 E1 08             shl         ecx,8  
000007FEA70878D9 03 C8                add         ecx,eax  
  3440: 
  3441:         /* If it's a surrogate, we find the uv that the surrogate pair encodes.
  3442:          * */
  3443:         if (UNLIKELY(UNICODE_IS_SURROGATE(uv))) {
000007FEA70878DB 8B C1                mov         eax,ecx  
000007FEA70878DD 48 2D 00 D8 00 00    sub         rax,0D800h  
000007FEA70878E3 48 3D FF 07 00 00    cmp         rax,7FFh  
000007FEA70878E9 77 3A                ja          Perl_utf16_to_utf8_base+0A5h (07FEA7087925h)  
  3444: 
  3445: #define FIRST_HIGH_SURROGATE UNICODE_SURROGATE_FIRST
  3446: #define LAST_HIGH_SURROGATE  0xDBFF
  3447: #define FIRST_LOW_SURROGATE  0xDC00
  3448: #define LAST_LOW_SURROGATE   UNICODE_SURROGATE_LAST
  3449: #define FIRST_IN_PLANE1      0x10000
  3450: 
  3451:             if (UNLIKELY(p >= pend) || UNLIKELY(uv > LAST_HIGH_SURROGATE)) {
000007FEA70878EB 48 3B DF             cmp         rbx,rdi  
000007FEA70878EE 73 7E                jae         Perl_utf16_to_utf8_base+0EEh (07FEA708796Eh)  
000007FEA70878F0 81 F9 FF DB 00 00    cmp         ecx,0DBFFh  
000007FEA70878F6 77 76                ja          Perl_utf16_to_utf8_base+0EEh (07FEA708796Eh)  
  3453:             }
  3454:             else {
  3455:                 U32 low_surrogate = (p[(U8) high_byte] << 8) + p[(U8) low_byte];
000007FEA70878F8 41 0F B6 04 1E       movzx       eax,byte ptr [r14+rbx]  
000007FEA70878FD 0F B6 14 2B          movzx       edx,byte ptr [rbx+rbp]  
000007FEA7087901 C1 E2 08             shl         edx,8  
000007FEA7087904 03 D0                add         edx,eax  
  3456:                 if (UNLIKELY(! inRANGE(low_surrogate, FIRST_LOW_SURROGATE,
000007FEA7087906 8B C2                mov         eax,edx  
000007FEA7087908 48 2D 00 DC 00 00    sub         rax,0DC00h  
000007FEA708790E 48 3D FF 03 00 00    cmp         rax,3FFh  
000007FEA7087914 77 58                ja          Perl_utf16_to_utf8_base+0EEh (07FEA708796Eh)  
  3457:                                                        LAST_LOW_SURROGATE)))
  3458:                 {
  3459:                     croak("Malformed UTF-16 surrogate");
  3460:                 }
  3461: 
  3462:                 p += 2;
  3463: 
  3464:                 /* Here uv is the high surrogate.  Combine with low surrogate
  3465:                  * just computed to form the actual U32 code point.
  3466:                  *
  3467:                  * From https://unicode.org/faq/utf_bom.html#utf16-4 */
  3468:                 uv = FIRST_IN_PLANE1 + (uv << 10) - (FIRST_HIGH_SURROGATE << 10)
000007FEA7087916 C1 E1 0A             shl         ecx,0Ah  
000007FEA7087919 48 83 C3 02          add         rbx,2  
000007FEA708791D 81 C1 00 24 A0 FC    add         ecx,0FCA02400h  
000007FEA7087923 03 CA                add         ecx,edx  
  3469:                                      + low_surrogate - FIRST_LOW_SURROGATE;
  3470:             }
  3471:         }
  3472: 
  3473:         /* Here, 'uv' is the real U32 we want to find the UTF-8 of */
  3474:         d = uv_to_utf8(d, uv);
000007FEA7087925 48 83 64 24 20 00    and         qword ptr [rsp+20h],0  
000007FEA708792B 45 33 C9             xor         r9d,r9d  
000007FEA708792E 44 8B C1             mov         r8d,ecx  
000007FEA7087931 49 8B D2             mov         rdx,r10  
000007FEA7087934 48 8B CE             mov         rcx,rsi  
000007FEA7087937 E8 D4 DD FF FF       call        Perl_uvoffuni_to_utf8_flags_msgs (07FEA7085710h)  
000007FEA708793C 4C 8B D0             mov         r10,rax  
  3432: 
  3433:     while (p < pend) {
000007FEA708793F 48 3B DF             cmp         rbx,rdi  
000007FEA7087942 72 85                jb          Perl_utf16_to_utf8_base+49h (07FEA70878C9h)  
  3475:     }
  3476: 
  3477:     *newlen = d - dstart;
000007FEA7087944 48 8B 44 24 70       mov         rax,qword ptr [newlen]  
000007FEA7087949 49 8B CA             mov         rcx,r10  
  3478:     return d;
  3479: }
000007FEA708794C 48 8B 5C 24 50       mov         rbx,qword ptr [rsp+50h]  
000007FEA7087951 49 2B CF             sub         rcx,r15  
000007FEA7087954 48 8B 6C 24 58       mov         rbp,qword ptr [rsp+58h]  
000007FEA7087959 48 8B 74 24 60       mov         rsi,qword ptr [rsp+60h]  
000007FEA708795E 48 89 08             mov         qword ptr [rax],rcx  
000007FEA7087961 49 8B C2             mov         rax,r10  
000007FEA7087964 48 83 C4 30          add         rsp,30h  
000007FEA7087968 41 5F                pop         r15  
000007FEA708796A 41 5E                pop         r14  
000007FEA708796C 5F                   pop         rdi  
000007FEA708796D C3                   ret  
  3452:                 croak("Malformed UTF-16 surrogate");
000007FEA708796E 48 8D 15 63 8E 21 00 lea         rdx,[string "Malformed UTF-16 surrogate" (07FEA72A07D8h)]  
000007FEA7087975 48 8B CE             mov         rcx,rsi  
000007FEA7087978 E8 57 7B 05 00       call        Perl_croak (07FEA70DF4D4h)  
000007FEA708797D CC                   int         3  
  3429:         croak("panic: utf16_to_utf8%s: odd bytelen %" UVuf,
000007FEA708797E 80 7C 24 78 00       cmp         byte ptr [high_byte],0  
000007FEA7087983 48 8D 05 76 32 07 00 lea         rax,[string "" (07FEA70FAC00h)]  
000007FEA708798A 4C 8D 05 07 8E 21 00 lea         r8,[string "_reversed" (07FEA72A0798h)]  
000007FEA7087991 4C 0F 44 C0          cmove       r8,rax  
000007FEA7087995 48 8D 15 0C 8E 21 00 lea         rdx,[string "panic: utf16_to_utf8%s: odd byt@"... (07FEA72A07A8h)]  
000007FEA708799C E8 33 7B 05 00       call        Perl_croak (07FEA70DF4D4h)  
000007FEA70879A1 CC                   int         3  
--- No source file -------------------------------------------------------------
000007FEA70879A2 CC                   int         3  
000007FEA70879A3 CC                   int         3  

I have to goto work, i'm not reverse engineering what symbol

000007FEA7087925 48 83 64 24 20 00    and         qword ptr [rsp+20h],0  
000007FEA708792B 45 33 C9             xor         r9d,r9d  
000007FEA708792E 44 8B C1             mov         r8d,ecx  
000007FEA7087931 49 8B D2             mov         rdx,r10  
000007FEA7087934 48 8B CE             mov         rcx,rsi  
000007FEA7087937 E8 D4 DD FF FF       call        Perl_uvoffuni_to_utf8_flags_msgs (07FEA7085710h)  
000007FEA708793C 4C 8B D0             mov         r10,rax  

is doing with its 7 c stack arguments in a super hot loop.

Best solution for all the issues I described is just to delete all of these functions/lines of code, and copy paste something from another tried and tested and trusted with a large user base BSD licensed FOSS software project and not think too much more into this API feature, Perl's attempt at reinventing the wheel made the wheel a triangle .

@bulk88
Copy link
Contributor

bulk88 commented Aug 10, 2025

dictionary.bin is this ~120 kb file https://github.com/google/brotli/blob/master/c/common/dictionary.bin

use Devel::Peek;
use v5.30;
 #use Inline  ('force', 'noclean');

use Inline C => Config =>
 PRE_HEAD => '#define PERL_NO_GET_CONTEXT 1

 ';
use Inline C => Config => BUILD_NOISY => 1;
use Inline C => Config => PREFIX => 'MXS';

use Inline C => <<'END_OF_C_CODE';

typedef DWORD (__stdcall* pRtlUnicodeToUTF8N_t)(
PCHAR  UTF8StringDestination,
ULONG  UTF8StringMaxByteCount,
PULONG UTF8StringActualByteCount,
PCWSTR UnicodeStringSource,
ULONG  UnicodeStringByteCount
);

pRtlUnicodeToUTF8N_t pRtlUnicodeToUTF8N;
static void S_u16t(pTHX_ CV* cv);
static void S_u16mst(pTHX_ CV* cv);

void BOOT1() {
    dTHX;
    HANDLE h = GetModuleHandle("ntdll");
    pRtlUnicodeToUTF8N = (pRtlUnicodeToUTF8N_t)GetProcAddress(h, "RtlUnicodeToUTF8N");
	newXS( "u16t", S_u16t, __FILE__);
	newXS( "u16mst", S_u16mst, __FILE__);
	return;
}

static void S_u16t(pTHX_ CV* cv) {
    dXSARGS;

    if(items != 2)
	croak_xs_usage(cv, "u8buf, u2buf");
    SV* u2 = POPs;
    SV* u1 = POPs;
	STRLEN l;
	char * pv = SvPV(u1, l);
	char * out = SvGROW(u2, l*4);
	STRLEN o = l*4;
    Perl_utf16_to_utf8_reversed(aTHX_ pv, out, l, &o);
	SvCUR_set(u2, o);
	SvPOK_only(u2);
	PUTBACK;
    return;
}

static void S_u16mst(pTHX_ CV* cv) {
    dXSARGS;

    if(items != 2)
	croak_xs_usage(cv, "u8buf, u2buf");
    SV* u2 = POPs;
    SV* u1 = POPs;
	STRLEN l;
	//sv_dump(u1);
	//sv_dump(u2);
	char * pv = SvPV(u1, l);
	char * out = SvGROW(u2, l*4);
	ULONG rlen;
	ULONG err = pRtlUnicodeToUTF8N(out, l*4, &rlen, pv,l);
	if(err)
		croak("fsyscall %u", err);
	SvCUR_set(u2, rlen);
	SvPOK_only(u2);
	PUTBACK;
    return;
}
END_OF_C_CODE


$, = "\n";

#use v5.30;
#use strict;
#use warnings;
#$DB::single = 1;
my $s4;
#$DB::single = 1;

use Encode;
use Benchmark qw(:all);
use Devel::Peek 'Dump';


BOOT1();

my $_UTF16 ||= Encode::find_encoding ('utf16-le');

my $f;
open($f, '<:utf8:raw', 'dictionary.bin');
#binmode($f, ":utf8");
{
	undef($,);
	my $d = <$f>;
	say(length($d));
	utf8::upgrade($d);
	my $u16 = Encode::encode(
	$_UTF16,
	$d);
	my $in = "\0" x length($u16);
	my $out = "\0" x (length($u16)*3);
	$in = $u16;
	sub pl { u16t($in, $out); $out = "\0" x length($out); }
	sub ms { u16mst($in, $out); $out = "\0" x length($out); }
	pl();
	ms();
	cmpthese(900000, {
	'pl' => \&pl,
	'ms' => \&ms,
	'ppl' => \&pl,
	'mms' => \&ms,
	});
}
#Dump($out);
exit;

  Starting Cleaning Up Stage
  Finished Cleaning Up Stage

Finished Build Compile Stage

1752
        Rate   pl  ppl   ms  mms
pl   77439/s   --  -2% -85% -85%
ppl  78920/s   2%   -- -84% -85%
ms  506187/s 554% 541%   --  -3%
mms 519630/s 571% 558%   3%   --

C:\sources>

A professionally written algorithm is 5.5x faster than Perl's C code. Everyone should start searching GH for BSD software written in C and find something to copy paste into perl.

@bulk88
Copy link
Contributor

bulk88 commented Aug 10, 2025

there was a bug above in the benchmark, it was only looping 1.7 kb, undef($/); was the fix, on the full data set, MS's new 2000s era UTF8 functions, not-prehistoric, not-delivered in a monthly .zip/.cab/floppy security update, and not from the early 80s, new Unicode API on >= Vista, its actually 8.3x faster than perls algo.

C:\sources>perl khwutf16removal.pl
122784
       Rate   pl  ppl  mms   ms
pl   1107/s   --  -2% -89% -89%
ppl  1127/s   2%   -- -89% -89%
mms 10297/s 830% 814%   --   0%
ms  10297/s 830% 814%   0%   --

C:\sources>

Someone should try whatever the typical .soes are for doing this on android/apple/bsd/GPL3 land in a benchmark against symbol Perl_utf16_to_utf8_reversed(aTHX_ pv, out, l, &o);. Ive heard the words iconv and gconv before. and iconv is a gutted src code compat layer to the next gen GNU gconv lib or something. Its been a few years since I looked at it, I dont remember anything about it anymore. Android/Apple/Freebsd will have their own independent copycats of course since GPL3 is illegal for Android/Apple platforms to use, so newlib muscl and whatever else they use will have their own unique tricks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants